Complex networks analysis of language complexity

نویسندگان

  • Diego R. Amancio
  • Sandra M. Aluísio
  • Osvaldo N. Oliveira
  • Luciano da Fontoura Costa
چکیده

Methods from statistical physics, such as those involving complex networks, have been increasingly used in quantitative analysis of linguistic phenomena. In this paper, we represented pieces of text with different levels of simplification in co-occurrence networks and found that topological regularity correlated negatively with textual complexity. Furthermore, in less complex texts the distance between concepts, represented as nodes, tended to decrease. The complex networks metrics were treated with multivariate pattern recognition techniques, which allowed us to distinguish between original texts and their simplified versions. For each original text, two simplified versions were generated manually with increasing number of simplification operations. As expected, distinction was easier for the strongly simplified versions, where the most relevant metrics were node strength, shortest paths and diversity. Also, the discrimination of complex texts was improved with higher hierarchical network metrics, thus pointing to the usefulness of considering wider contexts around the concepts. Though the accuracy rate in the distinction was not as high as in methods using deep linguistic knowledge, the complex network approach is still useful for a rapid screening of texts whenever assessing complexity is essential to guarantee accessibility to readers with limited reading ability. Introduction. – Statistical physics has been applied in the analysis of a variety of phenomena from social sciences and linguistics [1–4], in many cases permitting unprecedented interpretation based on quantitative measurements [5, 6]. Of particular importance for the present study has been the use of complex networks in treating linguistic issues [1,2,7–9], including those associated with natural language processing (NLP) tasks [10–13]. One normally exploits the finding that networks deriving from text exhibit a scale-free topology, where the degree distribution follows a power law, regarded as a consequence of the rich-get-richer paradigm [14]. Examples of application of network concepts in NLP include strategies for automatic summarization [10], evaluation of machine translations [11, 12], analysis of lexical resources [8], language evolution [7] and authorship recognition [13]. In this paper, we combine metrics extracted from networks representing text and pattern recognition methods to investigate complexity in texts. Motivation for this endeavor came from the need to assess whether written material in the Internet is accessible to widespread communities, including people with low levels of education. This is especially important for countries such as Brazil, for which official figures revealed that in 2009 7% of the population were classified as illiterate; 21% as literate at the rudimentary level; 47% as literate at the basic level; and only 25% as literate at the advanced level. Concerted efforts have been made to develop methods to detect and simplify complex textual structures, with the aim of making information accessible to people who are not able to read complex texts. Although the implementation 1http://www.ipm.org.br p-1 ar X iv :1 30 2. 44 90 v1 [ ph ys ic s. so cph ] 1 9 Fe b 20 13 D. R. Amancio et al. of simplification strategies is the most important item in applications requiring text simplification, before applying any technique one should first identify the pieces of texts considered complex. Most importantly, one has to unveil the characteristics that make a text difficult to read. Indeed, many approaches have been developed to quantify textual complexity [15], but there is still no consensus on how complexity can be measured effectively. Here we propose a new approach based on a possible correlation between textual complexity and regularity of network topology. More specifically, we apply the methodology described in Refs. [11–13], which combines topological characterization based on descriptors with pattern recognition strategies, to evaluate complexity in texts with distinct levels of simplification. We shall show that considering wider contexts around words in the text is useful for the intended complexity discrimination. Methodology. – Networks are employed to represent texts, whose topology is examined through several metrics. The patterns emerging from the topological features are investigated with clustering and supervised learning techniques in order to correlate with textual complexity. Database. The database was developed under the PorSimples project, available online from http://www2. nilc.icmc.usp.br/wiki/index.php/English. All 113 texts were collected from the Brazilian Zero Hora newspaper 3 and their simplified versions were created by a linguist expert in textual simplification. For each original (non-simplified) text there are two corresponding simplified versions, which differ from each other by the number of operations applied to simplify the text (see the list of possible operations in Table S2 of the SI). The first one, referred to as natural simplification, was obtained with only a few simplification operations. In the version obtained with the procedure referred to as strong simplification, all the possible simplification operations were performed. The statistics related to the three corpora and examples of original texts and simplifications are given in Figure S1 and in Table S3 of the SI, respectively. Network Formation. To model texts as complex networks, preprocessing steps were applied. Stopwords were eliminated and the remaining words were lemmatized. That is to say, words were converted to their singular (nouns) and infinitive forms (verbs), so that words with different inflections but related to the same concept were taken as a single node in the network. To perform this conversion, ambiguities were resolved by using the MXPost part-of-speech tagger based on the Ratnaparki’s model [16]. Then, each word in the pre-processed text was represented as a node and edges were established depend2Some approaches are presented in the Supplementary Information (SI), available from https://dl.dropbox.com/u/2740286/supplementary.pdf 3http://www.zerohora.com.br 4Stopwords are very frequent words usually conveying little semantic meaning, such as articles and prepositions. BACK

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effects of Task Complexity on Input-Driven Uptake of Salient Linguistic Features

The present study investigated the effects of cognitive complexity of pedagogical tasks on the learners’ uptake of salient features in the input. For the purpose of data collection, three versions of a decision-making task (simple, mid, and complex) were employed. Three intact classes (each 20 language learners) were randomly assigned to three groups.  Each group transacted a version of a decis...

متن کامل

The Effect of Task Complexity on EFL Learners’ Narrative Writing Task Performance

This study examined the effects of task complexity on written narrative production under different task complexity conditions by EFL learners at different proficiency levels. Task complexity was manipulated along Robinson’s (2001b) proposed task complexity dimension of Here-and-Now (simple) vs. There-and-Then (complex) in. Accordingly, three specific measures of the written narratives were targ...

متن کامل

Clause Complexity in Applied Linguistics Research Article Abstracts by Native and Non-Native English Writers: Taxis, Expansion and Projection

Halliday’s Systemic Functional Linguistics (SFL) has stood the test of time as a model of text analysis. The present literature contains a plethora of studies that while taking the ‘clause’ as a unit of analysis have put into investigation the metafunctions in research articles of a single field of study or those of various fields in comparison. Although ‘clause complex’ is another unit of SF a...

متن کامل

The Effect of Task Sequencing on the Writing Fluency of English as Foreign Language Learners

This study investigated the effect of sequencing tasks from simple to complex along +/- reasoning demands on fluency in writing task performance of English as Foreign Language (EFL) learners. The participants of this study included 90 intermediate EFL learners from three intact class divisions at the Islamic Azad Uni- versity, Shahr-e-Qods Branch. They were distributed in three groups: Experime...

متن کامل

Cognitive Task Complexity and Iranian EFL Learners’ Written Linguistic Performance across Writing Proficiency Levels

Recently tasks, as the basic units of syllabi, and the cognitive complexity, as the criterion for sequencing them, have caught many second language researchers’ attention. This study sought to explore the effect of utilizing the cognitively simple and complex tasks on high- and low-proficient EFL Iranian writers’ linguistic performance, i.e., fluency, accuracy, lexical complexity, and structura...

متن کامل

An Investigation into the Effects of Joint Planning on Complexity, Accuracy, and Fluency across Task Complexity

The current study aimed to examine the effects of strategic planning, online planning, strategic planning and online planning combined (joint planning), and no planning on the complexity, accuracy, and fluency of oral productions in two simple and complex narrative tasks. Eighty advanced EFL learners performed one simple narrative task and a complex narrative task with 20 minutes in between. Th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1302.4490  شماره 

صفحات  -

تاریخ انتشار 2013